The rise of short-term rentals, and the platforms that enable them, have changed urban landscapes and housing markets across metropolitan cities worldwide. Platforms such as Airbnb were originally conceived to support a mutually beneficial relationship: to connect homeowners and travelers looking for affordable accommodations. However people recognized the profitability of unregulated short-term rentals and it quickly became an investment scheme that transformed the hospitality industry and accelerated the housing crisis.
The City of Toronto is a vibrant mosaic of its neighbourhoods, each uniquely defined by its people and its topology among other characteristics. As such, each neighbourhood has also experienced the impacts of short-terms rentals to a different degree. In recent years, the local government has been steadily increasing regulations surrounding short-term rentals in hopes of reducing the impact on the housing crisis. This report is motivated accordingly and aims to study the dynamics of short-term trends from a spatial perspective by exploring the diversity of Airbnb listings across neighbourhoods in Toronto.
In particular, this report examines Price per Night as the main variable of interest (i.e., the dependent variable). The analysis leverages Airbnb listing data for Toronto as-of September 5, 2024 and analyzes various covariates and their impact on the dependent variable.
This report relieson data from three distinct sources. See Section 2.2 for data processing steps.
Airbnb Listing Data: sourced from Inside Airbnb, the dataset is a snapshot of current listings, and includes fields such as listing and host IDs, accommodation type and price per night. Note the following assumptions associated with this data source:
Income Data: sourced from the Toronto neighbourhood statistics data used for Homework 3. This data is current to 2023 and on a neighbourhood grain. For each neighbourhood, the median income and population fields are used.
Toronto Open Data: sourced from the City of Toronto’s Open Data Portal, two files (both projected to WGS 84) are used:
Landmark Data: sourced by geocoding top 10 Toronto Landmarks from TripAdvisor using a Google API Key.
Prior to joining the above-stated data sources and performing any spatial computations, the following data pre-processing steps are carried out:
sf objects using the 4326 CRS/WGS 84 projection.Buffers: for TTC Stops and Landmarks, spatial buffers are created based on distances of 1 kilometer and 2 kilometers respectively. Then using a spatial intersection, an Airbnb listing can be defined to be near a subway and landmark if it is within the specified buffer.
Feature Engineering: the final dataset is obtained by aggregating on the Neighbourhood field (or using pre-aggregated fields) to obtain:
All monetary amounts are assumed to be in Canadian dollars. The resultant geospatial data type is areal. A preview of the dataset can be found in Table 1.
| Neighbourhood | Median Income | Population | Avg. Price per Night | Number of Listings | Number of Listings Near Subway | Number of Listings Near Landmark | Number of Listings per Capita | Percent of Listings Near Subway | Percent of Listings Near Landmark |
|---|---|---|---|---|---|---|---|---|---|
| Agincourt North | 91 | 30,280 | 63.38 | 52 | 0 | 0 | 1.7173 | 0.0000 | 0.0000 |
| Agincourt South-Malvern West | 88 | 21,990 | 68.95 | 122 | 0 | 0 | 5.5480 | 0.0000 | 0.0000 |
| Alderwood | 130 | 11,900 | 107.46 | 70 | 41 | 0 | 5.8824 | 0.5857 | 0.0000 |
| Annex | 144 | 29,180 | 193.28 | 689 | 689 | 689 | 23.6121 | 1.0000 | 1.0000 |
| Banbury-Don Mills | 119 | 26,910 | 179.98 | 81 | 54 | 23 | 3.0100 | 0.6667 | 0.2840 |
| Bathurst Manor | 113 | 15,435 | 118.12 | 110 | 109 | 0 | 7.1267 | 0.9909 | 0.0000 |
Table 1: Preview of First 6 Rows of Cleaned Dataset
First consider the spatial map in Figure 1 which details the Average Price per Night across neighbourhoods in Toronto, as-of September 5, 2024. Locations of TTC Subway Stations are denoted by black dots and locations of the Top 10 Landmarks are denoted by purple dots on the map.
Based on the last note, a Boolean variable is included in the features to indicate whether the neighbourhood is one of Bridle Path-Sunnybrook-York Mills or Scarborough Village (i.e., outlier). This may be a pivotal feature when fitting regression models as it will allow the model to separate out the “outliers” and better generalize for the remaining 138 neighbourhoods.
Figure 1: Average Price per Night across Toronto and Notable Locations
Prior to computing Global Moran’s I, consider the following adjacency matrices types which are generally uniquely well-suited based on the study context:
The connectivity of neighbourhoods based on various adjacency matrices is illustrated in Figure 2. Given that the Average Price per Night is the result of a somewhat arbitrary aggregation to neighbourhood-level, and that patterns in price might “transcend” neighbourhood boundaries, a Distance based adjacency matrix might be the most appropriate.
Figure 2: Various Adjaceny Matrices
The Moran’s I statistics for the above adjacency matrices using their corresponding weight matrices are detailed in Table 2. The Distance based weights results in the highest Moran’s I = \(0.217\). All weight matrices yield a positive Moran’s I indicating positive spatial autocorrelation such that similar values of price per night are clustered together. Note that all p-values are below \(0.05\) and thus considered to be statistically significant. In other words, the null hypothesis that there is no spatial autocorrelation is rejected.
| Weight Matrix | Moran’s I | Expectation | Variance | p-value |
|---|---|---|---|---|
| Queen | 0.2138924 | -0.007194245 | 0.002386915 | 3.015996e-06 |
| Distance | 0.2170340 | -0.007194245 | 0.002980003 | 1.999467e-05 |
| kNN3 | 0.1874309 | -0.007194245 | 0.004072117 | 1.144503e-03 |
| kNN6 | 0.1421564 | -0.007194245 | 0.002044641 | 4.784123e-04 |
Table 2: Results of Moran’s Test for Various Weight Matrices
The Distanced based weight matrix will be used hereon out. To understand the spatial lag at which spatial autocorrelation exists, consider the correlogram shown in Figure 3. Upto and including the fourth lag, the correlogram indicates positive spatial autocorrelation. Beyond the fourth lag, Moran’s I tends to hover around 0 indicating there is no or very limited negative spatial autocorrelation.
Figure 3: Correlogram of Spatial Lags for Distance Based Weights
For an empirical estimate of Moran’s I, consider a permutation test via Monte-Carlo simulation over 9999 replicates. The results of the test are shown in Figure 4, where the vertical red line indicates the true Moran’s I value of the data. This test yields a p-value \(=0.0012\) which is statistically significant.
Figure 4: Permutation Test of Moran’s I via Monte-Carlo Simulation
Overall, the above analysis indicates positive spatial autocorrelation of Average Price per Night across Toronto neighbourhoods.
See Figure 5 for a scatterplot of Local Moran’s I corresponding to each neighbourhood in Toronto. The slope of the solid line reflects the Global Moran’s I estimate (\(\approx 0.217\)) indicating positive spatial autocorrelation across Toronto. Examining the four quadrants:
Figure 5: Moran Scatterplot for Distance Based Weights
The map in Figure 6 further illustrates Local Moran’s I clusters. The following spatial patterns arise:
Figure 6: Local Moran’s I Clusters
The map in Figure 7 illustrates the Local Getis-Ord G* value for each neighbourhood. The following patterns can be noted:
An interesting insight to note is that a similar hotspot does not arise near the Bridle Path-Sunnybrook-York Mills neighbourhood. This might be because there are numerous other adjacent neighbourhoods with moderately-priced listings that “dilutes” the impact of the significantly higher Average Price per Night.
Figure 7: Local Getis-Ord G*
The exploration in the previous section illustrates the presence of spatial autocorrelation when examining the Average Price per Night for Airbnb listings across neighbourhoods of Toronto. Then consider modelling the dependent variable as a function of various independent variables, previously defined in Section 2.2 (Data Processing). This section will fit linear, spatial and conditional autoregressive models and present the fitted parameter estimates:
Note, all autoregressive modelling is done assuming Distance Based weights. Also, the specific combination of covariates chosen for each model was selected among other combinations to reduce the value of Moran’s I for the fitted residuals. That is, minimizing the spatial autocorrelation in the fitted residuals such that the model accounts for maximal spatial autocorrelation in Average Price per Night. See Table 3 for a summary of the Moran’s tests:
| Model | Moran’s I | Expectation | Variance | p-value |
|---|---|---|---|---|
| Linear | 0.1100039726 | -0.007194245 | 0.002966119 | 0.01570171 |
| SAR Lag | 0.0039558354 | -0.007194245 | 0.002959692 | 0.41880433 |
| SAR Error | 0.0056231746 | -0.007194245 | 0.002959741 | 0.40687187 |
| SAR Lag + Error | 0.0006586813 | -0.007194245 | 0.002959586 | 0.44261215 |
| CAR | 0.0058175161 | -0.007194245 | 0.002961708 | 0.40551714 |
Table 3: Results of Moran’s Test for Various Regression Models
It is evident based on the statistically significant Moran’s I that the linear model still has spatial autocorrelation that is unexplained by the selected covariates. Comparatively, the autoregressive models have lower Moran’s I indicating limited spatial autocorrelation in the residuals and non-statistically significant p-values such that the null hypothesis of no residual spatial autocorrelation can be accepted. See Table 4 for the estimated spatial parameters for the autoregressive models:
| Model | \(\hat{\rho}\) (p-value) | \(\hat{\lambda}\) (p-value) | \(\hat{\sigma}^2\) | LR Test Value | AIC |
|---|---|---|---|---|---|
| SAR Lag | 0.2371 (0.004) | N/A | 1043.8 | 8.3128 | 1388.0 |
| SAR Error | N/A | 0.25494 (0.051) | 1240.6 | 3.8166 | 1410.5 |
| SAR Lag+Error | 0.22985 (0.039) | 0.01906 (0.914) | 1044.4 | 8.3234 | 1390.0 |
| CAR | N/A | 0.25161 (0.158) | 1097.6 | 1.9892 | 1394.3 |
Table 4: Fitted Spatial Parameters and Model Metrics for Various Autoregressive Models
The CAR model does not have a statistically significant spatial parameter \(\lambda\) and has a very low LR Test value, compared to the SAR models. Similarly the SAR model with both Spatial Lag and Error does not have a statistically significant spatial parameter \(\lambda\) although it has a very high LR Test value. On the other hand, the SAR models with Spatial Lag or Spatial Error both have statistically significant spatial parameters, \(\rho\) and \(\lambda\) respectively. Between these two models, the SAR model with Spatial Lag is selected to best model Average Price per Night while accounting for spatial autocorrelation since:
In the previous section, a SAR model with Spatial Lag was selected to best-model the dependent variable Average Price per Night. See Table 5 for the fitted coefficients. Of the six fitted coefficients, all are statistically significant except for Number of Listings which has a somewhat significant p-value \(\approx 0.15\) and Percent of Listings near Landmarks. The estimate for the Is Outlier Neighbourhood coefficient is much higher in magnitude than the other coefficients, which makes intuitive sense because this variable is used to account for the stark difference in Average Price per Night for the two neighbourhoods (i.e., Bridle Path-Sunnybrook-York Mills, Scarborough Village) when compared to the average neighbourhood.
| Coefficient | Estimate | p-value |
|---|---|---|
| Intercept | 41.163510 | 0.005794 |
| Number of Listings | -0.114544 | 0.148167 |
| Number of Listings near Subway | 0.150386 | 0.054734 |
| Percent of Listings near Landmarks | 6.269885 | 0.501103 |
| Median Income | 0.395088 | 3.295e-05 |
| Is Outlier Neighbourhood | 334.650708 | < 2.2e-16 |
Table 5: Fitted Coefficients of SAR Model with Spatial Lag
Consider again the SAR model with Spatial Lag, but fitted without the Is Outlier Neighbourhood covariate. See Table 6 for a comparison of Moran’s I tests of the residuals. Although neither Moran’s I values are statistically significant (do not reject the null hypothesis of no residual autocorrelation), the value is higher after removing the covariate indicating that there is more residual autocorrelation. See Table 7 for a comparison of the estimated spatial parameters and model metrics. Removing the covariate yields a spatial parameter \(\rho\) which is not statistically significant, a much higher residual variance, AIC and SSE, and a much lower LR Test value.
| Includes Outlier Covariate | Moran’s I | Expectation | Variance | p-value |
|---|---|---|---|---|
| Yes | 0.0039558354 | -0.007194245 | 0.002959692 | 0.41880433 |
| No | 0.007760144 | -0.007194245 | 0.002392637 | 0.37990000 |
Table 6: Results of Moran’s Test for SAR model with Spatial Lag, with and without Outlier Covariate
| Includes Outlier Covariate | \(\hat{\rho}\) (p-value) | \(\hat{\sigma}^2\) | LR Test Value | AIC | SSE |
|---|---|---|---|---|---|
| Yes | 0.23710 (0.004) | 1043.8 | 8.3128 | 1388.0 | 146129.5 |
| No | 0.15955 (0.157) | 2570.6 | 2.0008 | 1511.3 | 359879.1 |
Table 7: Fitted Spatial Parameters and Model Metrics for SAR model with Spatial Lag, with and without Outlier Covariate
For a further comparison, consider the followings maps in Figure 8 of the fitted residuals for each model. Both maps use the same colour scale to map residuals. Overall, the residuals in Figure 8a are of much lower absolute magnitude (i.e., \(|\hat{\epsilon}_a| \leq \$100\)) and very similar in colour, while the residuals in Figure 8b have more variance (i.e., \(|\hat{\epsilon}_b| \leq \$350\)). Notably, the residuals for the two outlier neighbourhoods and some central neighbourhoods are darker in hue when compared to the map in Figure 8a. This suggests that including a Boolean covariate to indicate the two outlier neighbourhoods successfully allowed the model to better-fit the remaining 138 neighbourhoods thus reducing the residuals and SSE.
Figure 8a: Map of Residual Average Price per Night, With Outlier Covariate
Figure 8b: Map of Residual Average Price per Night, Without Outlier Covariate
As a final step, consider the map in Figure 9 depicting the fitted values using the SAR model with Spatial Lag outlined in Table 5. Note this map uses the same colour scale as the map used in Figure 1.The Bridle Path-Sunnybrook-York Mills and Scarborough Village neighbourhoods have a fitted Average Price per Night that is much higher than all other neighbourhoods, which aligns with the true values and is due to the inclusion of the Outlier covariate. Majority of the neighbourhoods have fitted values between $100 and $200, which agrees with the true values as seen in Figure 1. Some neighbourhoods which have a true Average Price per Night in the $200 range were successfully identified, namely Waterfront Communities-The Island and a few more in Midtown Toronto that intersect with Yonge St.
Figure 9: Map of Fitted Average Price per Night
To conclude, the above analysis found significant evidence of positive spatial correlation in the Average Price per Night of Airbnb listings in Toronto as-of September 5, 2024. This was established globally (i.e., over the entire study area) and in clusters throughout neighbourhoods in the city. More specifically, listings in neighbourhoods adjacent to Yonge St., near the downtown core and bordering the lake tend to be higher in price. Conversely, listings in suburban neighbourhoods tend to be lower in price. The fitted spatial regression model highlights some interesting relationships between the Average Price per Night and selected covariates. In particular, there appears to be a positive relationship with proximity to TTC Subway Station and Median Income, and an slightly inverse relationship with Number of Listings.
Although the analysis uncovered some valuable insights, there are a few limitations. Namely:
To improve upon the analysis, the following next steps could be explored: